When can Multi-Site Datasets be Pooled for Regression? Hypothesis Tests, 2-consistency and Neuroscience Applications

نویسندگان

Hao Zhou

Yilin Zhang

Vamsi Ithapu

Grace Wahba

Vikas Singh

چکیده

Many studies in biomedical and health sciences involve small sample sizes due to logistic or financial constraints. Often, identifying weak (but scientifically interesting) associations between a set of predictors and a response necessitates pooling datasets from multiple diverse labs or groups. While there is a rich literature in statistical machine learning to address distributional shifts and inference in multi-site datasets, it is less clear when such pooling is guaranteed to help (and when it does not) – independent of the inference algorithms we use. In this paper, we present a hypothesis test to answer this question, both for classical and high dimensional linear regression. We precisely identify regimes where pooling datasets across multiple sites is sensible, and how such policy decisions can be made via simple checks executable on each site before any data transfer ever happens. With a focus on Alzheimer’s disease studies, we show empirical results in regimes suggested by our analysis, where pooling a locally acquired dataset with data from an international study improves power.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

When can Multi-Site Datasets be Pooled for Regression? Hypothesis Tests, 2-consistency and Neuroscience Applications (Supplementary Material)

Remarks on transformations in pre-processing step: For all i ∈ {1, ..., k}, after applying the transformation (shift correction), we pool (Xi, yi) together to estimate β∗. Note that in general the transformation (shift correction) should not depend on the responses yi, otherwise we get a dependence on the noise. To see this, notice that yi = Xiβi + i where Xi is the transformed set of features....

متن کامل

Simultaneous robust estimation of multi-response surfaces in the presence of outliers

A robust approach should be considered when estimating regression coefficients in multi-response problems. Many models are derived from the least squares method. Because the presence of outlier data is unavoidable in most real cases and because the least squares method is sensitive to these types of points, robust regression approaches appear to be a more reliable and suitable method for addres...

متن کامل

Nonparametric Learning in High Dimensions

This thesis develops flexible and principled nonparametric learning algorithms to explore, understand, and predict high dimensional and complex datasets. Such data appear frequently in modern scientific domains and lead to numerous important applications. For example, exploring high dimensional functional magnetic resonance imaging data helps us to better understand brain functionalities; infer...

متن کامل

Integrating Local and Global Error Statistics for Multi-Scale RBF Network Training: An Assessment on Remote Sensing Data

BACKGROUND This study discusses the theoretical underpinnings of a novel multi-scale radial basis function (MSRBF) neural network along with its application to classification and regression tasks in remote sensing. The novelty of the proposed MSRBF network relies on the integration of both local and global error statistics in the node selection process. METHODOLOGY AND PRINCIPAL FINDINGS The ...

متن کامل

Regression spline bivariate probit models: A practical approach to testing for exogeneity

Bivariate probit models can deal with a problem usually known as endogeneity. This issue is likely to arise in observational studies when confounders are unobserved. We are concerned with testing the hypothesis of exogeneity (or absence of endogeneity) when using regression spline recursive and sample selection bivariate probit models. Likelihood ratio and gradient tests are discussed in this c...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2017

When can Multi-Site Datasets be Pooled for Regression? Hypothesis Tests, 2-consistency and Neuroscience Applications

نویسندگان

چکیده

منابع مشابه

When can Multi-Site Datasets be Pooled for Regression? Hypothesis Tests, 2-consistency and Neuroscience Applications (Supplementary Material)

Simultaneous robust estimation of multi-response surfaces in the presence of outliers

Nonparametric Learning in High Dimensions

Integrating Local and Global Error Statistics for Multi-Scale RBF Network Training: An Assessment on Remote Sensing Data

Regression spline bivariate probit models: A practical approach to testing for exogeneity

عنوان ژورنال:

اشتراک گذاری